Model Selection

Large-scale pretraining

# Large-scale pretraining

Siglip2 Large Patch16 512

SigLIP 2 is an improved model based on SigLIP, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Wav2vec2 Large Xls R 300m Ru

This model is a Russian automatic speech recognition (ASR) model fine-tuned on the common_voice_17_0 dataset based on facebook/wav2vec2-xls-r-300m, with a word error rate (WER) of 0.195.

Speech Recognition

CLIP ViT H 14 Laion2b S32b B79k

This is a vision-language model based on the OpenCLIP framework, trained on the LAION-2B English subset, excelling in zero-shot image classification and cross-modal retrieval tasks.

CLIP ViT B 32 Laion2b S34b B79k

A vision-language model trained on the LAION-2B English dataset based on the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval

Aimv2 1b Patch14 224.apple Pt

AIM-v2 is an image encoder model based on the timm library, with a scale of 1 billion parameters, suitable for image feature extraction tasks.

Image Classification

Eva Giant Patch14 Clip 224.laion400m

The EVA CLIP model is a vision-language model based on OpenCLIP and the timm framework, supporting zero-shot image classification tasks.

Eva02 Large Patch14 Clip 224.merged2b

The EVA CLIP model is a vision-language model based on OpenCLIP and timm model weights, supporting tasks such as zero-shot image classification.

Image Classification

Eva02 Enormous Patch14 Clip 224.laion2b Plus

EVA-CLIP is a large-scale vision-language model based on the CLIP architecture, supporting tasks such as zero-shot image classification.

Vit Large Patch14 Clip 224.dfn2b

A vision transformer model based on the CLIP architecture, focused on image feature extraction, released by Apple.

Image Classification

Seamless M4t V2 Large Speech Encoder

Speech encoder module extracted from SeamlessM4Tv2-Large, excelling in cross-language and multilingual sequence-level audio classification tasks

Audio Classification

Transformers Supports Multiple Languages

Vit Gigantic Patch14 Clip 224.metaclip 2pt5b

A dual-framework compatible vision model trained on MetaCLIP-2.5B dataset, supporting both OpenCLIP and timm frameworks

Image Classification

Qwen2-Audio is the Tongyi Qianwen large audio language model series, supporting both voice chat and audio analysis interaction modes.

Transformers English

CLIP ViT B 32 Laion2b S34b B79k

CLIP ViT-B/32 model trained on the LAION-2B dataset based on the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval tasks

Owsm Ctc V3.1 1B

OWSM-CTC is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC, supporting multilingual speech recognition, speech translation, and language identification.

Speech Recognition Other

Chronos T5 Large

Chronos is a family of pretrained time series forecasting models based on language model architecture, which converts time series into token sequences through quantization and scaling for training, supporting probabilistic forecasting.

Whisper Large V3 Ft Cv16 Mn

A speech recognition model fine-tuned on the Common Voice 16.0 dataset based on OpenAI Whisper Large V3

Speech Recognition

A speech encoder based on the Conformer architecture, pretrained on 4.5 million hours of unlabeled audio data, supporting over 143 languages

Speech Recognition

Transformers Supports Multiple Languages

Sentence Camembert Large

French sentence embedding model based on CamemBERT-large, providing powerful semantic search capabilities

Text Embedding French

Vit H 14 CLIPA 336 Laion2b

CLIPA-v2 model, trained on the laion2B-en dataset, focusing on zero-shot image classification tasks

Metaclip L14 Fullcc2.5b

MetaCLIP is a large-scale vision-language model trained on 2.5 billion data points from CommonCrawl (CC), revealing CLIP's data filtering methodology

CLIP ViT B 32 DataComp.XL S13b B90k

This is a CLIP ViT-B/32 model trained on the DataComp-1B dataset, designed for tasks like zero-shot image classification and image-text retrieval.

Ro Bart Large 512

This is a BART large model pretrained from scratch with 400 million parameters, specifically designed for Romanian language.

Large Language Model

Transformers Other

Pile-T5 Large is an encoder-decoder model trained on The Pile dataset based on the T5x library, primarily used for English text-to-text generation tasks.

Large Language Model

Transformers English

A vision Transformer model trained using the DINOv2 method for self-supervised image feature extraction

Image Classification

IDEFICS is an open-source multimodal model capable of processing both image and text inputs to generate text outputs, serving as an open-source reproduction of Deepmind's Flamingo model.

Transformers English

CLIP ViT B 32 Laion2b E16

A vision-language pretrained model implemented based on OpenCLIP, supporting zero-shot image classification tasks

CLIP ViT L 14 CommonPool.XL.clip S13b B90k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification and cross-modal retrieval

CLIP ViT L 14 CommonPool.XL S13b B90k

A vision-language pretrained model based on the CLIP architecture, supporting zero-shot image classification and cross-modal retrieval tasks

CLIP ViT B 16 CommonPool.L.basic S1b B8k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks

CLIP ViT B 32 CommonPool.M.clip S128m B4k

Zero-shot image classification model based on CLIP architecture, supporting general pooling functionality

CLIP ViT B 32 CommonPool.S.laion S13m B4k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks

CLIP ViT B 32 CommonPool.S.image S13m B4k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks

CLIP ViT B 32 CommonPool.S.text S13m B4k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks

ARBERTv2 is an upgraded BERT model trained on Modern Standard Arabic (MSA) with a corpus of 243GB text, containing 27.8 billion tokens.

Large Language Model

Transformers Arabic

Eva02 Large Patch14 Clip 224.merged2b S4b B131k

EVA02 is a large-scale vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks.

Image Classification

Mt5 Multilingual XLSum Rust

An mT5 model fine-tuned on the XL-Sum dataset for 45 languages, designed for multilingual summarization tasks.

Text Generation Supports Multiple Languages

CLIP ViT B 16 Laion2b S34b B88k

A multimodal vision-language model trained on the OpenCLIP framework, completed on the LAION-2B English dataset, supporting zero-shot image classification tasks

MaltBERTa is a large-scale pretrained language model based on Maltese text, using the RoBERTa architecture, developed by the MaCoCu project.

Large Language Model Other

A language model pretrained on large-scale Bulgarian and Macedonian texts, part of the MaCoCu project

Large Language Model Other

Model Facebookptbrlarge

A Brazilian Portuguese speech recognition model fine-tuned on the Common Voice dataset based on Facebook's wav2vec2-large-xlsr-53-portuguese model

Speech Recognition

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase